Rui CHEN Ying TONG Ruiyu LIANG
Deep neural networks have achieved great success in visual tracking by learning a generic representation and leveraging large amounts of training data to improve performance. Most generic object trackers are trained from scratch online and do not benefit from a large number of videos available for offline training. We present a real-time generic object tracker capable of incorporating temporal information into its model, learning from many examples offline and quickly updating online. During the training process, the pre-trained weight of convolution layer is updated lagging behind, and the input video sequence length is gradually increased for fast convergence. Furthermore, only the hidden states in recurrent network are updated to guarantee the real-time tracking speed. The experimental results show that the proposed tracking method is capable of tracking objects at 150 fps with higher predicting overlap rate, and achieves more robustness in multiple benchmarks than state-of-the-art performance.
This letter presents a novel technique to achieve a fast inference of the binarized convolutional neural networks (BCNN). The proposed technique modifies the structure of the constituent blocks of the BCNN model so that the input elements for the max-pooling operation are binary. In this structure, if any of the input elements is +1, the result of the pooling can be produced immediately; the proposed technique eliminates such computations that are involved to obtain the remaining input elements, so as to reduce the inference time effectively. The proposed technique reduces the inference time by up to 34.11%, while maintaining the classification accuracy.
Music classification has been inspired by the remarkable success of deep learning. To enhance efficiency and ensure high performance at the same time, a hybrid architecture that combines deep learning and Broad Learning (BL) is proposed for music classification tasks. At the feature extraction stage, the Random CNN (RCNN) is adopted to analyze the Mel-spectrogram of the input music sound. Compared with conventional CNN, RCNN has more flexible structure to adapt to the variance contained in different types of music. At the prediction stage, the BL technique is introduced to enhance the prediction accuracy and reduce the training time as well. Experimental results on three benchmark datasets (GTZAN, Ballroom, and Emotion) demonstrate that: i) The proposed scheme achieves higher classification accuracy than the deep learning based one, which combines CNN and LSTM, on all three benchmark datasets. ii) Both RCNN and BL contribute to the performance improvement of the proposed scheme. iii) The introduction of BL also helps to enhance the prediction efficiency of the proposed scheme.
Handwritten numeral recognition is a classical and important task in the computer vision area. We propose two novel deep learning models for this task, which combine the edge extraction method and Siamese/Triple network structures. We evaluate the models on seven handwritten numeral datasets and the results demonstrate both the simplicity and effectiveness of our models, comparing to baseline methods.
Wenli ZHU Min ZHANG Chenxi WU Lingqing ZENG
A convolutional neural network (CNN) for broadband direction of arrival (DOA) estimation of far-field electromagnetic signals is presented. The proposed algorithm performs a nonlinear inverse mapping from received signal to angle of arrival. The signal model used for algorithm is based on the circular antenna array geometry, and the phase component extracted from the spatial covariance matrix is used as the input of the CNN network. A CNN model including three convolutional layers is then established to approximate the nonlinear mapping. The performance of the CNN model is evaluated in a noisy environment for various values of signal-to-noise ratio (SNR). The results demonstrate that the proposed CNN model with the phase component of the spatial covariance matrix as the input is able to achieve fast and accurate broadband DOA estimation and attains perfect performance at lower SNR values.
Andros TJANDRA Sakriani SAKTI Satoshi NAKAMURA
Recurrent Neural Network (RNN) has achieved many state-of-the-art performances on various complex tasks related to the temporal and sequential data. But most of these RNNs require much computational power and a huge number of parameters for both training and inference stage. Several tensor decomposition methods are included such as CANDECOMP/PARAFAC (CP), Tucker decomposition and Tensor Train (TT) to re-parameterize the Gated Recurrent Unit (GRU) RNN. First, we evaluate all tensor-based RNNs performance on sequence modeling tasks with a various number of parameters. Based on our experiment results, TT-GRU achieved the best results in a various number of parameters compared to other decomposition methods. Later, we evaluate our proposed TT-GRU with speech recognition task. We compressed the bidirectional GRU layers inside DeepSpeech2 architecture. Based on our experiment result, our proposed TT-format GRU are able to preserve the performance while reducing the number of GRU parameters significantly compared to the uncompressed GRU.
Joanna Kazzandra DUMAGPI Woo-Young JUNG Yong-Jin JEONG
Threat object recognition in x-ray security images is one of the important practical applications of computer vision. However, research in this field has been limited by the lack of available dataset that would mirror the practical setting for such applications. In this paper, we present a novel GAN-based anomaly detection (GBAD) approach as a solution to the extreme class-imbalance problem in multi-label classification. This method helps in suppressing the surge in false positives induced by training a CNN on a non-practical dataset. We evaluate our method on a large-scale x-ray image database to closely emulate practical scenarios in port security inspection systems. Experiments demonstrate improvement against the existing algorithm.
Chihiro WATANABE Kaoru HIRAMATSU Kunio KASHINO
Interpretability has become an important issue in the machine learning field, along with the success of layered neural networks in various practical tasks. Since a trained layered neural network consists of a complex nonlinear relationship between large number of parameters, we failed to understand how they could achieve input-output mappings with a given data set. In this paper, we propose the non-negative task matrix decomposition method, which applies non-negative matrix factorization to a trained layered neural network. This enables us to decompose the inference mechanism of a trained layered neural network into multiple principal tasks of input-output mapping, and reveal the roles of hidden units in terms of their contribution to each principal task.
In this letter, the performance of a state-of-the-art deep learning (DL) algorithm in [5] is analyzed and evaluated for orthogonal frequency-division multiplexing (OFDM) receivers, in the presence of harmonic spur interference. Moreover, a novel spur cancellation receiver structure and algorithm are proposed to enhance the traditional OFDM receivers, and serve as a performance benchmark for the DL algorithm. It is found that the DL algorithm outperforms the traditional algorithm and is much more robust to spur carrier frequency offset.
Jiateng LIU Wenming ZHENG Yuan ZONG Cheng LU Chuangao TANG
In this letter, we propose a novel deep domain-adaptive convolutional neural network (DDACNN) model to handle the challenging cross-corpus speech emotion recognition (SER) problem. The framework of the DDACNN model consists of two components: a feature extraction model based on a deep convolutional neural network (DCNN) and a domain-adaptive (DA) layer added in the DCNN utilizing the maximum mean discrepancy (MMD) criterion. We use labeled spectrograms from source speech corpus combined with unlabeled spectrograms from target speech corpus as the input of two classic DCNNs to extract the emotional features of speech, and train the model with a special mixed loss combined with a cross-entrophy loss and an MMD loss. Compared to other classic cross-corpus SER methods, the major advantage of the DDACNN model is that it can extract robust speech features which are time-frequency related by spectrograms and narrow the discrepancies between feature distribution of source corpus and target corpus to get better cross-corpus performance. Through several cross-corpus SER experiments, our DDACNN achieved the state-of-the-art performance on three public emotion speech corpora and is proved to handle the cross-corpus SER problem efficiently.
Mahmud Dwi SULISTIYO Yasutomo KAWANISHI Daisuke DEGUCHI Ichiro IDE Takatsugu HIRAYAMA Jiang-Yu ZHENG Hiroshi MURASE
Numerous applications such as autonomous driving, satellite imagery sensing, and biomedical imaging use computer vision as an important tool for perception tasks. For Intelligent Transportation Systems (ITS), it is required to precisely recognize and locate scenes in sensor data. Semantic segmentation is one of computer vision methods intended to perform such tasks. However, the existing semantic segmentation tasks label each pixel with a single object's class. Recognizing object attributes, e.g., pedestrian orientation, will be more informative and help for a better scene understanding. Thus, we propose a method to perform semantic segmentation with pedestrian attribute recognition simultaneously. We introduce an attribute-aware loss function that can be applied to an arbitrary base model. Furthermore, a re-annotation to the existing Cityscapes dataset enriches the ground-truth labels by annotating the attributes of pedestrian orientation. We implement the proposed method and compare the experimental results with others. The attribute-aware semantic segmentation shows the ability to outperform baseline methods both in the traditional object segmentation task and the expanded attribute detection task.
Ippei HAMAMOTO Masaki KAWAMURA
We have developed a digital watermarking method that use neural networks to learn embedding and extraction processes that are robust against rotation and JPEG compression. The proposed neural networks consist of a stego-image generator, a watermark extractor, a stego-image discriminator, and an attack simulator. The attack simulator consists of a rotation layer and an additive noise layer, which simulate the rotation attack and the JPEG compression attack, respectively. The stego-image generator can learn embedding that is robust against these attacks, and also, the watermark extractor can extract watermarks without rotation synchronization. The quality of the stego-images can be improved by using the stego-image discriminator, which is a type of adversarial network. We evaluated the robustness of the watermarks and image quality and found that, using the proposed method, high-quality stego-images could be generated and the neural networks could be trained to embed and extract watermarks that are robust against rotation and JPEG compression attacks. We also showed that the robustness and image quality can be adjusted by changing the noise strength in the noise layer.
The spectrum sensing of the orthogonal frequency division multiplexing (OFDM) system in cognitive radio (CR) has always been challenging, especially for user terminals that utilize the full-duplex (FD) mode. We herein propose an advanced FD spectrum-sensing scheme that can be successfully performed even when severe self-interference is encountered from the user terminal. Based on the “classification-converted sensing” framework, the cyclostationary periodogram generated by OFDM pilots is exhibited in the form of images. These images are subsequently plugged into convolutional neural networks (CNNs) for classifications owing to the CNN's strength in image recognition. More importantly, to realize spectrum sensing against residual self-interference, noise pollution, and channel fading, we used adversarial training, where a CR-specific, modified training database was proposed. We analyzed the performances exhibited by the different architectures of the CNN and the different resolutions of the input image to balance the detection performance with computing capability. We proposed a design plan of the signal structure for the CR transmitting terminal that can fit into the proposed spectrum-sensing scheme while benefiting from its own transmission. The simulation results prove that our method has excellent sensing capability for the FD system; furthermore, our method achieves a higher detection accuracy than the conventional method.
Yun ZHANG Bingrui LI Shujuan YU Meisheng ZHAO
In this paper, we propose a new scheme which uses blind detection algorithm for recovering the conventional user signal in a system which the sporadic machine-to-machine (M2M) communication share the same spectrum with the conventional user. Compressive sensing techniques are used to estimate the M2M devices signals. Based on the Hopfield neural network (HNN), the blind detection algorithm is used to recover the conventional user signal. The simulation results show that the conventional user signal can be effectively restored under an unknown channel. Compared with the existing methods, such as using the training sequence to estimate the channel in advance, the blind detection algorithm used in this paper with no need for identifying the channel, and can directly detect the transmitted signal blindly.
Kota ANDO Kodai UEYOSHI Yuka OBA Kazutoshi HIROSE Ryota UEMATSU Takumi KUDO Masayuki IKEBE Tetsuya ASAI Shinya TAKAMAEDA-YAMAZAKI Masato MOTOMURA
Deep neural network (NN) has been widely accepted for enabling various AI applications, however, the limitation of computational and memory resources is a major problem on mobile devices. Quantized NN with a reduced bit precision is an effective solution, which relaxes the resource requirements, but the accuracy degradation due to its numerical approximation is another problem. We propose a novel quantized NN model employing the “dithering” technique to improve the accuracy with the minimal additional hardware requirement at the view point of the hardware-algorithm co-designing. Dithering distributes the quantization error occurring at each pixel (neuron) spatially so that the total information loss of the plane would be minimized. The experiment we conducted using the software-based accuracy evaluation and FPGA-based hardware resource estimation proved the effectiveness and efficiency of the concept of an NN model with dithering.
Xiao-Yi ZHAO Chao-Yi DONG Peng ZHOU Mei-Jia ZHU Jing-Wen REN Xiao-Yan CHEN
The paper employed an Alexnet, which is a deep learning framework, to automatically diagnose the damages of wind power generator blade surfaces. The original images of wind power generator blade surfaces were captured by machine visions of a 4-rotor UAV (unmanned aerial vehicle). Firstly, an 8-layer Alexnet, totally including 21 functional sub-layers, is constructed and parameterized. Secondly, the Alexnet was trained with 10000 images and then was tested by 6-turn 350 images. Finally, the statistic of network tests shows that the average accuracy of damage diagnosis by Alexnet is about 99.001%. We also trained and tested a traditional BP (Back Propagation) neural network, which have 20-neuron input layer, 5-neuron hidden layer, and 1-neuron output layer, with the same image data. The average accuracy of damage diagnosis of BP neural network is 19.424% lower than that of Alexnet. The point shows that it is feasible to apply the UAV image acquisition and the deep learning classifier to diagnose the damages of wind turbine blades in service automatically.
Kai NAKAMURA Kenta IWAI Yoshinobu KAJIKAWA
In this paper, we propose an automatic design support system for compact acoustic devices such as microspeakers inside smartphones. The proposed design support system outputs the dimensions of compact acoustic devices with the desired acoustic characteristic. This system uses a deep neural network (DNN) to obtain the relationship between the frequency characteristic of the compact acoustic device and its dimensions. The training data are generated by the acoustic finite-difference time-domain (FDTD) method so that many training data can be easily obtained. We demonstrate the effectiveness of the proposed system through some comparisons between desired and designed frequency characteristics.
Lianqiang LI Jie ZHU Ming-Ting SUN
Convolutional Neural Networks (CNNs) usually have millions or even billions of parameters, which make them hard to be deployed into mobile devices. In this work, we present a novel filter-level pruning method to alleviate this issue. More concretely, we first construct an undirected fully connected graph to represent a pre-trained CNN model. Then, we employ the spectral clustering algorithm to divide the graph into some subgraphs, which is equivalent to clustering the similar filters of the CNN into the same groups. After gaining the grouping relationships among the filters, we finally keep one filter for one group and retrain the pruned model. Compared with previous pruning methods that identify the redundant filters by heuristic ways, the proposed method can select the pruning candidates more reasonably and precisely. Experimental results also show that our proposed pruning method has significant improvements over the state-of-the-arts.
Guodong SUN Zhen ZHOU Yuan GAO Yun XU Liang XU Song LIN
In this paper we design a fast fabric defect detection framework (Fast-DDF) based on gray histogram back-projection, which adopts end to end multi-convoluted network model to realize defect classification. First, the back-projection image is established through the gray histogram on fabric image, and the closing operation and adaptive threshold segmentation method are performed to screen the impurity information and extract the defect regions. Then, the defect images segmented by the Fast-DDF are marked and normalized into the multi-layer convolutional neural network for training. Finally, in order to solve the problem of difficult adjustment of network model parameters and long training time, some strategies such as batch normalization of samples and network fine tuning are proposed. The experimental results on the TILDA database show that our method can deal with various defect types of textile fabrics. The average detection accuracy with a higher rate of 96.12% in the database of five different defects, and the single image detection speed only needs 0.72s.
Ryo MASUMURA Taichi ASAMI Takanobu OBA Sumitaka SAKAUCHI Akinori ITO
This paper demonstrates latent word recurrent neural network language models (LW-RNN-LMs) for enhancing automatic speech recognition (ASR). LW-RNN-LMs are constructed so as to pick up advantages in both recurrent neural network language models (RNN-LMs) and latent word language models (LW-LMs). The RNN-LMs can capture long-range context information and offer strong performance, and the LW-LMs are robust for out-of-domain tasks based on the latent word space modeling. However, the RNN-LMs cannot explicitly capture hidden relationships behind observed words since a concept of a latent variable space is not present. In addition, the LW-LMs cannot take into account long-range relationships between latent words. Our idea is to combine RNN-LM and LW-LM so as to compensate individual disadvantages. The LW-RNN-LMs can support both a latent variable space modeling as well as LW-LMs and a long-range relationship modeling as well as RNN-LMs at the same time. From the viewpoint of RNN-LMs, LW-RNN-LM can be considered as a soft class RNN-LM with a vast latent variable space. In contrast, from the viewpoint of LW-LMs, LW-RNN-LM can be considered as an LW-LM that uses the RNN structure for latent variable modeling instead of an n-gram structure. This paper also details a parameter inference method and two kinds of implementation methods, an n-gram approximation and a Viterbi approximation, for introducing the LW-LM to ASR. Our experiments show effectiveness of LW-RNN-LMs on a perplexity evaluation for the Penn Treebank corpus and an ASR evaluation for Japanese spontaneous speech tasks.